Shakespeare & Model Robustness

Selecting which texts belong in the corpus you plan to analyze (and which ones don't!) is a major interpretive problem. This process of selection is closely tied to the definition of our research question. At the same time, when we seek out patterns across texts, our scholarly arguments are strongest when they hold true across reasonable variations in selection critera.

For example, say that we perform a distant reading of Shakespeare's Comedies. We analyze them computationaly and present our findings to a scholarly audience. However, during Q&A, an objection is raised that the late romances like The Tempest and the problem plays like Measure for Measure are having a disporportionate impact on the pattens we have discovered. Their status as comedy is subject to debate. Clearly our claim about the Comedies as a whole has been invalidated!

We can anticipate this kind of objection by testing variations in our selection criteria. If a pattern holds true across variations that reflect scholarly debates about the categories themselves, then it offers a strong argumentative foothold. Alternately, if the linguistic pattern changes with variant corpora, then it offers a wider view of the discursive field.

Distinctive Words

In this set of exercises, we will identify words that are distinctive of Shakespeare's Comedies, as opposed to the Tragedies and Histories. The corpus for this task is the set of Shakespeare's plays, stripped of all character names and stage directions. Only dialogue remains. These have been made available by Michael Witmore from the Folger Digital Texts collection.

First, we will perform our distinctive word test using the three genres as they are assigned to the plays in the First Folio.

COMEDY
- The Tempest, The Two Gentlemen of Verona, The Merry Wives of Windsor, Measure for Measure, The Comedy of Errors, Much Ado About Nothing, Love's Labour's Lost, A Midsummer Night's Dream, The Merchant of Venice, As You Like It, The Taming of the Shrew, All's Well That Ends Well, Twelfth Night, The Winter's Tale
HISTORY
- King John, Richard II, Henry IV-Part 1, Henry IV-Part 2, Henry V, Henry VI-Part 1, Henry VI-Part 2, Henry VI-Part 3, Richard III, Henry VIII
TRAGEDY
- Troilus and Cressida, Coriolanus, Titus Andronicus, Romeo and Juliet, Timon of Athens, Julius Caesar, Macbeth, Hamlet, King Lear, Othello, Antony and Cleopatra, Cymbeline

Second, we will repeat the process using slightly different categories. In addition to COMEDY, HISTORY, and TRAGEDY, we will include ROMANCE and PROBLEM. Several plays will be shifted into these latter, contested categories.

ROMANCE
- Pericles, Cymbeline, The Winter's Tale, The Tempest, Two Noble Kinsmen
PROBLEM
- All's Well That Ends Well, Measure for Measure, Troilus and Cressida

Exercise 1

Read the text of Shakespeare's plays from files
Create a DataFrame with 4 columns
- Filename
- Genre as assigned in the First Folio
- Genre as revised to include Romances and Problem Plays
- Text

Note: Pericles and Two Noble Kinsmen were not included in the First Folio, both have been argued to be romances. How will you handle this in your labeling?



In [ ]:

    
import os

# Get a list of filenames for the corpus
filenames = os.listdir("corpora/FDT Shakespeare Stripped/")

# Read the files
texts = [ open("corpora/FDT Shakespeare Stripped/"+filename, 'rb').read()  for filename in filenames ]



In [ ]:

Exercise 2

Transform the texts of the plays in to a DTM, using Tf-Idf weighting
Create a new DataFrame with the First Folio genres and the DTM
Produce a list of distinctive words belonging to Comedies



In [ ]:

Exercise 3

Create a new DataFrame with the revised genres and the DTM
Produce a list of distinctive words belonging to Comedies
Compare results with those from Exercise 2



In [ ]:

Bonus Exercise

Train a classifier to distinguish between Comedy and Tragedy (in the First Folio)
Predict whether Pericles and Two Noble Kinsmen belong to either group
- How confident are the predictions?
Produce a list of the most important features in the model



In [ ]: